Realtime user ratings as a strategy for combatting misinformation: an experimental study

Because fact-checking takes time, verdicts are usually reached after a message has gone viral and interventions can have only limited effect. A new approach recently proposed in scholarship and piloted on online platforms is to harness the wisdom of the crowd by enabling recipients of an online message to attach veracity assessments to it. The intention is to allow poor initial crowd reception to temper belief in and further spread of misinformation. We study this approach by letting 4000 subjects in 80 experimental bipartisan communities sequentially rate the veracity of informational messages. We find that in well-mixed communities, the public display of earlier veracity ratings indeed enhances the correct classification of true and false messages by subsequent users. However, crowd intelligence backfires when false information is sequentially rated in ideologically segregated communities. This happens because early raters’ ideological bias, which is aligned with a message, influences later raters’ assessments away from the truth. These results suggest that network segregation poses an important problem for community misinformation detection systems that must be accounted for in the design of such systems.

After having clicked on our advertisement on MTurk or Prolific, subjects were routed to a screener study in which they were asked a standard question about ideological selfidentification 1 . Subjects identifying as moderate were remunerated $0.15 and excluded from the study. Remaining subjects were instructed on their task, indicated their informed consent, and were asked the ideological self-identification question a second time. Subjects whose ideological leaning did not match their self-reported ideology in the initial screener study were remunerated $0.15 and excluded. The self-identification question, informed consent form, and experimental instructions are presented in Supplementary Fig. S1. Subjects who completed their task successfully earned $1.5. When subjects entered the experiment, we informed them that they were to do ratings in 'groups' and, depending on the condition, explicated whether they could see others' ratings or not. To mimic an online social media platform, where no financial incentives for a certain individual behavior are present, we chose to pay subjects flat fees instead of paying them for the accuracy with which they classified true and false messages.

Data Quality
Apart from checking for consistent ideological leaning, we undertook further measures to ensure high data quality. First, we excluded subjects who did not read messages carefully. We excluded subjects from further participation if they made a rating decision after having seen a message for less than three seconds thrice. Second, if a subject took more than 10 minutes to finish their task, they were excluded from the study. Third, we excluded subjects that failed at least one of our three attention check messages (example: "Europe is in the southern hemisphere." True / False). Subjects who failed one of these quality checks were only paid the remuneration for our screener study, $0.15. Excluded subjects were replaced with new subjects and their rating choices were not considered in the counts that were displayed as rating signals to subsequent subjects.
Of the 9,512 subjects who reacted on our study advertisement and went through the screener tasks, 3,284 (34.5%) were excluded because they identified as moderates, reported inconsistent ideological leaning, or had an active MTurk account parallel to their participation on Prolific. 684 (7.1%) subjects chose to not proceed with the experiment after the pre-screener or refused consent. 568 subjects (5.37%) were unable to participate because they showed up at a time when no spots for subjects of their ideological leaning were available. 975 subjects (10.3%) failed at least one of our quality checks: 511 subjects (5.4%) had unreasonably short response times, 387 (4.1%) failed an attention check question and 77 subjects (0.8%) did not finish within 10 minutes. Overall, participation rates were similar across samples. On MTurk, 41.4% of all individuals who reacted to our study advertisement (2,000 out of 4,834) successfully finished the study; on Prolific, it was 42.8% (2,000 out of 4,678).

Message Selection
Prior to the experiment, messages were calibrated through pretesting: Independent evaluations of 350 conservative and 350 liberal subjects ensured that liberal messages were more likely to be perceived as true by liberal subjects, and conservative messages more likely to be perceived as true by conservative subjects. Second, we ensured during the pretest phase that subjects were more likely to make a correct rather than an incorrect evaluation of a messages' veracity: The median rating of a message always had to reflect the actual veracity of the message. Put differently, each message had to have an average difficulty below 0.5 and a bias greater zero. Both are scope conditions of this study: The wisdom of crowds requires that more than 50 percent of the population make a correct rating decision in independence 18,20 , and there must be a difference in message difficulty among aligned versus misaligned subjects for a segregated rating orders to have any effect at all. Out of an original set of 144 messages used for pretesting, we chose a subset of 20 messages as compared to more messages to prevent subjects getting tired or inattentive after too many messages. Supplementary Table S1 presents an overview of the message set used in the experiment.

Subjects' Rating Behavior
Consistent with the scope conditions of this study, subjects were reasonably able to tell true from false messages in the independent sequences of the experiment. On average, subjects in the independent condition thought 66.7% of true messages to be true, while this was the case for only 40.1% of false messages (paired t-test: t = 34.5, p < 0.001, N = 500). At the same time, subjects in the independent condition found 66.2% of messages that aligned with their ideology to be true, but only 40.5% of misaligned messages (paired t-test t = 28.0, p < 0.001, N = 500). This shows that indeed, cognitive biases played a role in such a manner that messages supporting one's own viewpoint are more often thought to be trueindependent of the actual veracity of a message. It is noteworthy that liberals were especially inclined to find liberal true messages true (85.7% of messages in this category), while this was rarely the case for false conservative messages (19.1%). Classifying false conservative messages as false and liberal true messages as true made liberals better at making correct rating decisions overall (average liberal subject: 67% correct vs. conservative subject: 59%; t-test t = 11.3, p < 0.001, N = 500). At the same time, liberal subjects were more biased than conservatives: The difference between finding aligned versus misaligned messages true was larger among liberals (30.1 pp.) as compared to conservatives (20.1 pp.; t-test t = 5.5, p < 0.001, N = 500). An overview of subject behavior by message veracity and message ideology is presented in Supplementary Table S2. Overall, independent subjects rated 63.5% of their messages correctly, suggesting that ability in the population was indeed above 0.5 for at least a sizeable portion of all messages. Subjects in the independent condition of the Prolific sample were slightly better at making correct rating decisions than in the MTurk sample (64.8% versus 61.9%, t-test t = 3.8, p < 0.001, N = 500). Simultaneously, the difference in finding aligned versus misaligned messages to be true (bias) was higher among Prolific subjects as compared to MTurk subjects (Prolific: 31.5 pp., MTurk: 19.9 pp.; t-test t = 6.4, p < 0.001, N = 500). Both differences are consistent with prior studies finding subject quality higher on Prolific 51,52subjects may have read questions more closely and thought of them more proactively. Indeed, a much lower number of quality check failures (94 subjects on Prolific versus 881 subjects on MTurk) suggests that fewer subjects showed satisficing behavior on Prolific. Because subject behavior was slightly different in the Prolific dataset as compared to the MTurk dataset, the Supplementary Robustness of Findings section reports additional analyses where underlying message parameters were constructed for each dataset separately. The regression models include an individual-level and a sequence-level variance term in the regression, and standard errors are clustered at the individual level. Only decisions for true messages with d align < 0.5; ̅ < 0.5. * p < 0.05; ** p < 0.01; *** p < 0.001 (two-sided) The regression models include an individual-level and a sequence-level variance term in the regression, and standard errors are clustered at the individual level. Only decisions for false messages with d align > 0.5; ̅ < 0.5. * p < 0.05; ** p < 0.01; *** p < 0.001 (two-sided)

Robustness of Findings
We report two analyses of the robustness of our findings. First, since subjects from the MTurk dataset showed slightly different behavior than subjects from the Prolific dataset (see section S4), we treat subjects from each dataset as different populations and compute message parameters for each dataset separately. Second, we take into account that message parameters are estimates and not true values and only included messages whose 95 % confidence interval of average difficulty ̅ did not overlap with the 0.5 threshold. For Hypotheses 2, we excluded all messages for which the upper confidence bound of dalign was below 0.5 and for Hypotheses 3 and 4, we excluded all messages for which the lower confidence bound was above 0.5. Note that since parameters were computed separately per dataset, sample sizes were smaller and standard errors larger. Supplementary Table S4 presents an overview of how many messages were selected for each Hypothesis and dataset. We then computed the fraction of correct rating decisions per sequence and conducted the same analyses as in the main results section. Consistent with Hypothesis 1, broadcasting ratings in integrated sequences led to an improvement in rating performance. In integrated groups, the fraction of correct rating decisions was higher than in independent sequences, both among liberal messages (72.7% versus 68.5%; ATE = 4.2%, p < 0.001, N = 40) and conservative messages (69.3% versus 63.7%; ATE = 5.6%, p < 0.001, N = 40). The additional results also support Hypothesis 2. Subjects were better at making correct rating decisions in groups where those who aligned with the connotation of a message were to do ratings first. In segregated groups where liberals rated first, the fraction of correct rating decisions rose by 5.7 percentage points as compared to independent groups (independent 71.4% versus liberal-first 77.0.%; two-sided randomization test: ATE = 5.7%, p < 0.001, N = 40). In segregated groups where conservatives rated first, average accuracy increased by 9.0 percentage points (independent 63.9% versus conservative-first 72.9%; two-sided randomization test: ATE = 9.0%, p < 0.001, N = 40).

Supplementary
Unlike the results in the main text, the fraction of correct ratings did not decrease significantly in segregated groups when dalign was above 0.5. However, results indicated effects of identical direction and similar strength: In segregated groups where liberals rated first, the fraction of correct ratings decreased by 6.6 percentage points (independent 55.2% versus liberal-first 48.6.%; two-sided randomization test: ATE = 6.6%, p = 0.126, N = 20).
In segregated groups where conservatives rated first, the fraction of correct ratings sunk by 3.4 percentage points as compared to independent sequences (independent 54.4% versus conservative-first 57.8.%; two-sided randomization test: ATE = 3.4%, p = 0.108, N = 40).
Consistent with the main results for Hypothesis 4, a multilevel logit regression for true messages did show significant increasing rating performance over conservative aligned individuals' positions (β = .019, p = .01), but no increasing performance for liberal aligned subjects. No decreasing performance among misaligned conservative or liberal subjects was found. Finally, in line with the main results for Hypothesis 5, no significant decreasing rating performance over aligned individuals' positions in rating groups was present; or increasing performance for misaligned subjects. We conclude that results from our robustness analyses are similar to those in the main text -Hypotheses 1, 2 and 4 allowed for identical conclusions, and Hypothesis 3 achieved similar effects despite lack of significance. The lack of significance for H3 can likely be attributed to the fact that fewer messages were considered in the robustness analysis, and hence that less statistical power was given.